AITopics

Country:

North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.73)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.47)

Heinz, Vilém, Vilím, Petr, Hanzálek, Zdeněk

Reinforcement Learning for Search Tree Size Minimization in Constraint Programming: New Results on Scheduling Benchmarks

arXiv.org Artificial IntelligenceAug-28-2025

Failure-Directed Search (FDS) is a significant complete generic search algorithm used in Constraint Programming (CP) to efficiently explore the search space, proven particularly effective on scheduling problems. This paper analyzes FDS's properties, showing that minimizing the size of its search tree guided by ranked branching decisions is closely related to the Multi-armed bandit (MAB) problem. Building on this insight, MAB reinforcement learning algorithms are applied to FDS, extended with problem-specific refinements and parameter tuning, and evaluated on the two most fundamental scheduling problems, the Job Shop Scheduling Problem (JSSP) and Resource-Constrained Project Scheduling Problem (RCPSP). The resulting enhanced FDS, using the best extended MAB algorithm and configuration, performs 1.7 times faster on the JSSP and 2.1 times faster on the RCPSP benchmarks compared to the original implementation in a new solver called OptalCP, while also being 3.5 times faster on the JSSP and 2.1 times faster on the RCPSP benchmarks than the current state-of-the-art FDS algorithm in IBM CP Optimizer 22.1. Furthermore, using only a 900-second time limit per instance, the enhanced FDS improved the existing state-of-the-art lower bounds of 78 of 84 JSSP and 226 of 393 RCPSP standard open benchmark instances while also completely closing a few of them.

artificial intelligence, machine learning, reinforcement learning, (21 more...)

doi: 10.1016/j.cie.2025.111413

2508.20056

Country: Europe (0.93)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
(2 more...)

Neural Information Processing SystemsAug-16-2025, 13:18:58 GMT

Multi-Armed Bandits with Bounded Arm-Memory: Near-Optimal Guarantees for Best-Arm Identification and Regret Minimization

We study the Stochastic Multi-armed Bandit problem under bounded arm-memory.

algorithm, cumulative regret, sample complexity, (15 more...)

Country:

Asia > India > Karnataka > Bengaluru (0.04)
Asia > India > West Bengal > Kharagpur (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Machine LearningMay-27-2025

Adversarial Bandit over Bandits: Hierarchical Bandits for Online Configuration Management

Avin, Chen, Lotker, Zvi, Mannor, Shie, Shabat, Gil, Shteingart, Hanan, Yadgar, Roey

Motivated by dynamic parameter optimization in finite, but large action (configurations) spaces, this work studies the nonstochastic multi-armed bandit (MAB) problem in metric action spaces with oblivious Lipschitz adversaries. We propose ABoB, a hierarchical Adversarial Bandit over Bandits algorithm that can use state-of-the-art existing "flat" algorithms, but additionally clusters similar configurations to exploit local structures and adapt to changing environments. We prove that in the worst-case scenario, such clustering approach cannot hurt too much and ABoB guarantees a standard worst-case regret bound of $O\left(k^{\frac{1}{2}}T^{\frac{1}{2}}\right)$, where $T$ is the number of rounds and $k$ is the number of arms, matching the traditional flat approach. However, under favorable conditions related to the algorithm properties, clusters properties, and certain Lipschitz conditions, the regret bound can be improved to $O\left(k^{\frac{1}{4}}T^{\frac{1}{2}}\right)$. Simulations and experiments on a real storage system demonstrate that ABoB, using standard algorithms like EXP3 and Tsallis-INF, achieves lower regret and faster convergence than the flat method, up to 50% improvement in known previous setups, nonstochastic and stochastic, as well as in our settings.

artificial intelligence, data mining, machine learning, (20 more...)

arXiv.org Machine Learning

2505.19061

Country:

Asia > Middle East > Israel (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

arXiv.org Artificial IntelligenceFeb-3-2025

Large Language Model-Enhanced Multi-Armed Bandits

Sun, Jiahang, Wang, Zhiyong, Yang, Runhan, Xiao, Chenjun, Lui, John C. S., Dai, Zhongxiang

Large language models (LLMs) have been adopted to solve sequential decision-making tasks such as multi-armed bandits (MAB), in which an LLM is directly instructed to select the arms to pull in every iteration. However, this paradigm of direct arm selection using LLMs has been shown to be suboptimal in many MAB tasks. Therefore, we propose an alternative approach which combines the strengths of classical MAB and LLMs. Specifically, we adopt a classical MAB algorithm as the high-level framework and leverage the strong in-context learning capability of LLMs to perform the sub-task of reward prediction. Firstly, we incorporate the LLM-based reward predictor into the classical Thompson sampling (TS) algorithm and adopt a decaying schedule for the LLM temperature to ensure a transition from exploration to exploitation. Next, we incorporate the LLM-based reward predictor (with a temperature of 0) into a regression oracle-based MAB algorithm equipped with an explicit exploration mechanism. We also extend our TS-based algorithm to dueling bandits where only the preference feedback between pairs of arms is available, which requires non-trivial algorithmic modifications. We conduct empirical evaluations using both synthetic MAB tasks and experiments designed using real-world text datasets, in which the results show that our algorithms consistently outperform previous baseline methods based on direct arm selection. Interestingly, we also demonstrate that in challenging tasks where the arms lack semantic meanings that can be exploited by the LLM, our approach achieves considerably better performance than LLM-based direct arm selection.

large language model, machine learning, natural language, (19 more...)

2502.01118

Country:

Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Asia > China > Hong Kong (0.04)
Asia > Bangladesh (0.04)
(2 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Krishnakumar, G, Sinha, Abhishek

Minimizing Queue Length Regret for Arbitrarily Varying Channels

arXiv.org Artificial IntelligenceJan-23-2025

We consider an online channel scheduling problem for a single transmitter-receiver pair equipped with $N$ arbitrarily varying wireless channels. The transmission rates of the channels might be non-stationary and could be controlled by an oblivious adversary. At every slot, incoming data arrives at an infinite-capacity data queue located at the transmitter. A scheduler, which is oblivious to the current channel rates, selects one of the $N$ channels for transmission. At the end of the slot, the scheduler only gets to know the transmission rate of the selected channel. The objective is to minimize the queue length regret, defined as the difference between the queue length at some time $T$ achieved by an online policy and the queue length obtained by always transmitting over the single best channel in hindsight. We propose a weakly adaptive Multi-Armed Bandit (MAB) algorithm for minimizing the queue length regret in this setup. Unlike previous works, we do not make any stability assumptions about the queue or the arrival process. Hence, our result holds even when the queueing process is unstable. Our main observation is that the queue length regret can be upper bounded by the regret of a MAB policy that competes against the best channel in hindsight uniformly over all sub-intervals of $[T]$. As a technical contribution of independent interest, we then propose a weakly adaptive adversarial MAB policy which achieves $\tilde{O}(\sqrt{N}T^{\frac{3}{4}})$ regret with high probability, implying the same bound for queue length regret.

data mining, length regret, machine learning, (16 more...)

2501.13551

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > India > Tamil Nadu > Chennai (0.04)
Asia > India > Maharashtra > Mumbai (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)
Information Technology > Artificial Intelligence > Machine Learning (0.69)
Information Technology > Data Science > Data Mining > Big Data (0.67)

Donâncio, Henrique, Barrier, Antoine, South, Leah F., Forbes, Florence

Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach

arXiv.org Artificial IntelligenceOct-16-2024

Reinforcement Learning (RL), when combined with function approximators such as Artificial Neural Networks (ANNs), has shown success in learning policies that outperform humans in complex games by leveraging extensive datasets (see, e.g., 33, 19, 39, 40). While ANNs were previously used as value function approximators [29], the introduction of Deep Q-Networks (DQN) by [24, 25] marked a significant breakthrough by improving learning stability through two mechanisms: the target network and experience replay. The experience replay (see 22) stores the agent's interactions within the environment, allowing sampling of past interactions in a random way that disrupts their correlation. The target network further stabilizes the learning process by periodically copying the parameters of the learning network. This strategy is crucial because the Bellman update --using estimations to update other estimations-- would otherwise occur using the same network, potentially causing divergence. By leveraging the target network, gradient steps are directed towards a periodically fixed target, ensuring more stability in the learning process. Additionally, the learning rate hyperparameter controls the magnitude of these gradient steps in optimizers such as the stochastic gradient descent algorithm, affecting the training convergence. The learning rate is one of the most important hyperparameters, with previous work demonstrating that decreasing its value during policy finetuning can enhance performance by up to 25% in vanilla DQN [3].

artificial intelligence, machine learning, reinforcement learning, (13 more...)

2410.12598

Country:

Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.05)
Oceania > Australia > Queensland > Brisbane (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(8 more...)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Neural Information Processing SystemsOct-8-2024, 08:38:24 GMT

Reviews: Bandit Learning with Positive Externalities

The paper studies the interesting problem of learning with externalities, in a multi-armed bandit (MAB) setting. The main idea is that there might be a bias in the preferences in the users arriving on on-line platforms. Specifically, future user arrivals on the on-line platforms are likely to have similar preferences to users who have previously accessed the same platform and were satisfied with the service. Since some on-line platforms use MAB algorithms for optimizing their service, the authors propose the Balanced Exploration (BE) MAB algorithm, which has a structured exploration strategy that takes into account this potential "future user preference bias" (referred to as "positive externalities"). The bias in the preference of the users is translated directly into reward values specific to users arriving to on-line platform: out of the m possible items/arms, each user has a preference for a subset of them (the reward for this being a Bernoulli reward with mean proportional to the popularity of the arm) and the rewards of all other arms will always be null.

algorithm, artificial intelligence, positive externality, (4 more...)

Technology: Information Technology > Artificial Intelligence (0.59)

Fanny Yang, Aaditya Ramdas, Kevin G. Jamieson, Martin J. Wainwright

A framework for Multi-A(rmed)/B(andit) Testing with Online FDR Control

Neural Information Processing SystemsOct-4-2024, 10:06:50 GMT

We propose an alternative framework to existing setups for controlling false alarms when multiple A/B tests are run over time. This setup arises in many practical applications, e.g. when pharmaceutical companies test new treatment options against control pills for different diseases, or when internet companies test their default webpages versus various alternatives over time. Our framework proposes to replace a sequence of A/B tests by a sequence of best-arm MAB instances, which can be continuously monitored by the data scientist. When interleaving the MAB tests with an online false discovery rate (FDR) algorithm, we can obtain the best of both worlds: low sample complexity and any time online FDR control. Our main contributions are: (i) to propose reasonable definitions of a null hypothesis for MAB instances; (ii) to demonstrate how one can derive an always-valid sequential p-value that allows continuous monitoring of each MAB test; and (iii) to show that using rejection thresholds of online-FDR algorithms as the confidence levels for the MAB algorithms results in both sample-optimality, high power and low FDR at any point in time. We run extensive simulations to verify our claims, and also report results on real data collected from the New Yorker Cartoon Caption contest.

algorithm, hypothesis, procedure, (14 more...)

Country:

North America > United States > New York (0.24)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.67)

Yahmed, Ahmed Ben, Calauzènes, Clément, Perchet, Vianney

Strategic Arms with Side Communication Prevail Over Low-Regret MAB Algorithms

arXiv.org Artificial IntelligenceAug-30-2024

It significantly extends the standard MAB problem, as arms can utilize In the strategic multi-armed bandit setting, when arms possess this reporting mechanism to influence the player's decisions. For perfect information about the player's behavior, they can establish example, arms may opt to report higher values initially to increase an equilibrium where: 1. they retain almost all of their value, 2. their chances of being selected in later rounds. Conversely, they they leave the player with a substantial (linear) regret. This study may report lower values at the outset to decrease the reserve price illustrates that, even if complete information is not publicly available in auctions [3]. Furthermore, our study takes into consideration the to all arms but is shared among them, it is possible to achieve existence of side communications among arms, governed by predefined a similar equilibrium. The primary challenge lies in designing a rules. This consideration is motivated by real-world scenarios communication protocol that incentivizes the arms to communicate in which such interactions are prevalent and influential.

algorithm, equilibrium, information, (14 more...)

doi: 10.1109/ICASSP48485.2024.10446895

2408.17101

Country:

Europe > France > Île-de-France > Paris > Paris (0.04)
North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.50)
Information Technology > Artificial Intelligence > Machine Learning (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.47)